Alice Gee, ag67642
Introduction
Since I can remember I have been an avid dog lover. Whether it was reading dog fact books or watching Animal Planet shows about dogs, I was often found surrounding myself in things related to this animal. Since I adopted my first dog last year, my love for dogs has only grown. In this project, I will be looking at the relationship between dog breeds, their overall size (i.e. height and weight), and how it correlates with temperament/behaviors. I will also look at variables like puppy prices, first and second group distinction, and intelligence levels.
library(tidyverse)
data1 <- read_csv("AKC_Breed_Info.csv")
data2 <- read_csv("Dog_Breed_Characteristics.csv")
Tidying: Reshaping
# pivot_longer and pivot_wider will be used to reshape the
# datasets in the wrangling section.
Joining/Merging
joined_data <- inner_join(data1, data2, by = c(Breed = "BreedName"))
joined_data %>% nrow
## [1] 107
anti_join(data1, data2, by = c(Breed = "BreedName")) %>% nrow
## [1] 44
anti_join(data2, data1, by = c(BreedName = "Breed")) %>% nrow
## [1] 143
full_join(data1, data2, by = c(Breed = "BreedName")) %>% nrow
## [1] 294
I used inner_join to join the two data sets together based on the breed of dogs. By doing this, I can make direct comparisons between physical and temperament attributes that make up each breed of dog. Moreover, I can compare across the classification of dog breeds to analyze the attributes that make up the classes of dogs. In data1, there were originally 150 observations, all of which were distinct entries. In data2, there were originally 250 observations, all of which were also distinct entries. 44 dog breeds were observed in data1 and not in data2, which 143 dog breeds were observed in data2 and not in data1. Overall, the joined dataset (after using inner join) had 107 distinct entries, with 187 dog breeds that were not included in the joined dataset. I chose to use inner join instead of other join methods to minimize the appearance of NAs during analysis and to minimize the nuanced naming of dog breeds that further distinguish within a breed classification.
Wrangling
# renaming and deleting irrelevant variables
joined_data <- joined_data %>% mutate(Group2 = str_replace_all(Group2,
"Terrier", "Terrier2")) %>% mutate(Group1 = str_replace_all(Group1,
"[ ]", "-"))
joined_data <- joined_data %>% pivot_longer(7:8, names_to = "GroupingNumber",
values_to = "GroupClassification")
joined_data <- joined_data %>% select(-AltBreedName, -MaleWtKg,
-MalaysiaGuardedDog, -MalaysiaProhibitedDog, -MalaysiaPopularity,
-PopularityUS2017)
joined_data <- joined_data %>% rename(LowHeightInches = height_low_inches) %>%
rename(HighHeightInches = height_high_inches) %>% rename(LowWeightLbs = weight_low_lbs) %>%
rename(HighWeightLbs = weight_high_lbs) %>% rename(Temperament = Temperment) %>%
rename(IntelligenceRating = Intelligence)
# selecting relevant temperament
joined_data <- joined_data %>% separate_rows(Temperament, sep = ", ") %>%
mutate(Indicator = 1, .after = Temperament) %>% pivot_wider(names_from = Temperament,
values_from = Indicator)
joined_data <- joined_data %>% select(Breed:GroupClassification,
Alert, Friendly, Affectionate, Loyal, Gentle, Independent,
Protective, Playful)
# creating indicator values for each temperament
joined_data <- joined_data %>% mutate(Alert = ifelse(is.na(Alert),
0, Alert)) %>% mutate(Friendly = ifelse(is.na(Friendly),
0, Friendly)) %>% mutate(Affectionate = ifelse(is.na(Affectionate),
0, Affectionate)) %>% mutate(Loyal = ifelse(is.na(Loyal),
0, Loyal)) %>% mutate(Gentle = ifelse(is.na(Gentle), 0, Gentle)) %>%
mutate(Independent = ifelse(is.na(Independent), 0, Independent)) %>%
mutate(Protective = ifelse(is.na(Protective), 0, Protective)) %>%
mutate(Playful = ifelse(is.na(Playful), 0, Playful))
# tidying data
joined_data <- joined_data %>% pivot_wider(names_from = GroupingNumber,
values_from = GroupClassification) %>% mutate(FirstClassification = Group1,
.after = Watchdog) %>% mutate(SecondClassification = Group2,
.after = FirstClassification) %>% select(-Group1, -Group2)
# editing bad data value (which was entered in centimeters
# instead of inches)
clean_joined_data <- joined_data %>% filter(!is.na(LowHeightInches))
vizsla_data <- clean_joined_data %>% filter(Breed == "Vizsla") %>%
mutate(LowHeightInches = LowHeightInches/2.54) %>% mutate(HighHeightInches = HighHeightInches/2.54)
clean_joined_data <- clean_joined_data %>% filter(Breed != "Vizsla")
clean_joined_data <- rbind(clean_joined_data, vizsla_data)
# creating a variable for average height and weight
clean_joined_data <- clean_joined_data %>% mutate(AvgHeightInches = (LowHeightInches +
HighHeightInches)/2, .after = "Breed") %>% mutate(AvgWeightLbs = (LowWeightLbs +
HighWeightLbs)/2, .after = "HighHeightInches")
# tidying data
clean_joined_data <- clean_joined_data %>% pivot_longer(13:20,
names_to = "Temperament", values_to = "Indicator")
clean_joined_data <- clean_joined_data %>% group_by(Breed, FirstClassification) %>%
arrange(desc(AvgPupPrice))
# insertion of average height based on first classification
# into joined_data
temp <- clean_joined_data %>% group_by(FirstClassification) %>%
summarize(MeanHeight = mean(AvgHeightInches, na.rm = TRUE))
clean_joined_data <- left_join(clean_joined_data, temp)
In the first section, I did preliminary cleaning of the joined dataset by renaming or removing irrelevant variables. Variable names were changed to match the naming convention of overall dataset or to make the variable name more clear. Variables such as popularity or alternative breed names were removed due to low relevance to this analysis. In the second section, I split the temperament column into individual attributes and selected behaviors of interest that were most common amongst the set of dog breeds. The third section is an extension of the second section in which I added indicator values (0 or 1) for each temperament to see which breeds had what temperament. In sections four and seven, I reshaped and tidied the dataset to make it easier to analyze. In section five, I identified the outlier data (i.e. Vizsla) whose height was entered in the wrong units. After conversion from centimeters into inches, I added the Vizsla data values back into the joined dataset. To make analysis across height and weight easier, I combined the minimum and maximum height/weights to create a new column for average heights/weights in section six. For section eight, I created a new variable based on average height across first classification of dog breeds, which will be used for plot #3.
Statistical Analysis for Numerical Variables
# Minimum, maximum, and average height across first
# classification of dog breed
clean_joined_data %>% group_by(FirstClassification) %>% summarize(AvgHeight = mean(AvgHeightInches,
na.rm = TRUE), MinHeight = min(LowHeightInches, na.rm = TRUE),
MaxHeight = max(HighHeightInches, na.rm = TRUE)) %>% knitr::kable()
| Herding |
21.43750 |
16.0 |
27.0 |
| Hound |
21.84211 |
7.0 |
35.0 |
| Non-Sporting |
15.00000 |
9.5 |
23.0 |
| Sporting |
21.60783 |
10.0 |
27.0 |
| Terrier |
15.10714 |
9.0 |
23.0 |
| Toy |
10.07143 |
6.0 |
15.0 |
| Working |
25.69048 |
16.0 |
35.5 |
# Minimum, maximum, and average weight across first
# classification of dog breed
clean_joined_data %>% group_by(FirstClassification) %>% summarize(AvgWeight = mean(AvgWeightLbs,
na.rm = TRUE), MinWeight = min(LowWeightLbs, na.rm = TRUE),
MaxWeight = max(HighWeightLbs, na.rm = TRUE)) %>% knitr::kable()
| Herding |
58.250000 |
29 |
120 |
| Hound |
56.763158 |
16 |
150 |
| Non-Sporting |
28.545455 |
9 |
70 |
| Sporting |
54.361111 |
22 |
85 |
| Terrier |
27.500000 |
10 |
70 |
| Toy |
9.321429 |
2 |
22 |
| Working |
95.523809 |
25 |
190 |
# Average puppy price across both classifications of dog
# breed
clean_joined_data %>% group_by(FirstClassification, SecondClassification) %>%
summarize(AvgPrice = mean(AvgPupPrice, na.rm = TRUE)) %>%
knitr::kable()
| Herding |
Gun |
1100.0000 |
| Herding |
Herding |
1266.6667 |
| Herding |
Sight |
1000.0000 |
| Hound |
Northern |
600.0000 |
| Hound |
Scenthound |
822.2222 |
| Hound |
Sight |
1077.7778 |
| Non-Sporting |
Companion |
1200.0000 |
| Non-Sporting |
Northern |
733.3333 |
| Sporting |
Gun |
1079.4118 |
| Sporting |
Sight |
2250.0000 |
| Terrier |
Terrier2 |
1103.5714 |
| Toy |
Companion |
829.1667 |
| Toy |
Terrier2 |
600.0000 |
| Working |
Companion |
1250.0000 |
| Working |
Guardian |
1380.7692 |
| Working |
Gun |
1675.0000 |
| Working |
Herding |
1125.0000 |
| Working |
Northern |
1150.0000 |
| Working |
Terrier2 |
1500.0000 |
# Standard deviation of puppy price across first
# classification of dog breed
clean_joined_data %>% group_by(FirstClassification) %>% summarize(StdDevPrice = sd(AvgPupPrice,
na.rm = TRUE)) %>% knitr::kable()
| Herding |
294.6615 |
| Hound |
438.6042 |
| Non-Sporting |
660.6732 |
| Sporting |
546.2879 |
| Terrier |
399.3076 |
| Toy |
306.3674 |
| Working |
583.3836 |
# Average and standard deviation of intelligence ratings
# across first classification of dog breed
clean_joined_data %>% group_by(FirstClassification) %>% summarize(AvgIntelligence = mean(IntelligenceRating,
na.rm = TRUE), StdDevIntelligence = sd(IntelligenceRating,
na.rm = TRUE)) %>% knitr::kable()
| Herding |
19.00000 |
14.25051 |
| Hound |
55.43750 |
14.00998 |
| Non-Sporting |
45.40000 |
18.22615 |
| Sporting |
32.12500 |
17.14367 |
| Terrier |
46.00000 |
11.33280 |
| Toy |
47.63636 |
18.86076 |
| Working |
41.46667 |
21.12107 |
# Average watchdog rating across both classifications of dog
# breed
clean_joined_data %>% group_by(FirstClassification, SecondClassification) %>%
summarize(AvgWatchdogScore = mean(Watchdog, na.rm = TRUE)) %>%
arrange(desc(AvgWatchdogScore)) %>% knitr::kable()
| Herding |
Gun |
6.000000 |
| Hound |
Northern |
6.000000 |
| Working |
Companion |
6.000000 |
| Working |
Terrier2 |
6.000000 |
| Working |
Guardian |
5.538462 |
| Herding |
Herding |
5.333333 |
| Working |
Herding |
5.000000 |
| Terrier |
Terrier2 |
3.857143 |
| Working |
Gun |
3.500000 |
| Sporting |
Gun |
3.470588 |
| Non-Sporting |
Northern |
3.333333 |
| Non-Sporting |
Companion |
3.250000 |
| Herding |
Sight |
3.000000 |
| Hound |
Sight |
2.666667 |
| Toy |
Companion |
2.666667 |
| Toy |
Terrier2 |
2.500000 |
| Working |
Northern |
2.500000 |
| Hound |
Scenthound |
2.000000 |
| Sporting |
Sight |
1.000000 |
# Frequency of temperament occurrence (using Indicator)
# across first classification of dog breeds
clean_joined_data %>% group_by(Temperament, FirstClassification) %>%
summarize(Frequency = mean(Indicator)) %>% arrange(desc(Frequency))
## # A tibble: 56 x 3
## # Groups: Temperament [8]
## Temperament FirstClassification Frequency
## <chr> <chr> <dbl>
## 1 Loyal Herding 0.625
## 2 Playful Toy 0.571
## 3 Playful Non-Sporting 0.545
## 4 Affectionate Sporting 0.5
## 5 Alert Herding 0.5
## 6 Alert Toy 0.429
## 7 Friendly Terrier 0.429
## 8 Friendly Toy 0.429
## 9 Loyal Working 0.429
## 10 Protective Herding 0.375
## # … with 46 more rows
Statistical Analysis for Categorical Variables
# Counts for each breed entry/row
clean_joined_data %>% ungroup() %>% count(Breed)
## # A tibble: 104 x 2
## Breed n
## <chr> <int>
## 1 Affenpinscher 8
## 2 Afghan Hound 8
## 3 Akita 8
## 4 American Staffordshire Terrier 8
## 5 American Water Spaniel 8
## 6 Australian Shepherd 8
## 7 Australian Terrier 8
## 8 Basenji 8
## 9 Basset Hound 8
## 10 Beagle 8
## # … with 94 more rows
# Counts for first classification of breed
clean_joined_data %>% ungroup() %>% count(FirstClassification) %>%
knitr::kable()
| Herding |
64 |
| Hound |
152 |
| Non-Sporting |
88 |
| Sporting |
144 |
| Terrier |
112 |
| Toy |
112 |
| Working |
168 |
# Counts for second classification of breed
clean_joined_data %>% ungroup() %>% count(SecondClassification) %>%
knitr::kable()
| Companion |
168 |
| Guardian |
104 |
| Gun |
160 |
| Herding |
64 |
| Northern |
48 |
| Scenthound |
72 |
| Sight |
88 |
| Terrier2 |
136 |
# Counts for temperament occurrence across breeds
clean_joined_data %>% group_by(Temperament) %>% summarize(Frequency = sum(Indicator)) %>%
knitr::kable()
| Affectionate |
34 |
| Alert |
32 |
| Friendly |
32 |
| Gentle |
25 |
| Independent |
19 |
| Loyal |
35 |
| Playful |
27 |
| Protective |
15 |
Overall Statistical Analysis
# Minimum, maximum, and average height
clean_joined_data %>% ungroup() %>% summarize(AvgHeight = mean(AvgHeightInches,
na.rm = TRUE), MinHeight = min(LowHeightInches, na.rm = TRUE),
MaxHeight = max(HighHeightInches, na.rm = TRUE)) %>% knitr::kable()
# Minimum, maximum, and average weight
clean_joined_data %>% ungroup() %>% summarize(AvgWeight = mean(AvgWeightLbs,
na.rm = TRUE), MinWeight = min(LowWeightLbs, na.rm = TRUE),
MaxWeight = max(HighWeightLbs, na.rm = TRUE)) %>% knitr::kable()
# Average and standard deviation of puppy price
clean_joined_data %>% ungroup() %>% summarize(AvgPrice = mean(AvgPupPrice,
na.rm = TRUE), StdDevPrice = sd(AvgPupPrice, na.rm = TRUE)) %>%
knitr::kable()
# Average and standard deviation of intelligence ratings
clean_joined_data %>% ungroup() %>% summarize(AvgIntelligence = mean(IntelligenceRating,
na.rm = TRUE), StdDevIntelligence = sd(IntelligenceRating,
na.rm = TRUE)) %>% knitr::kable()
# Average watchdog rating
clean_joined_data %>% ungroup() %>% summarize(AvgWatchdogScore = mean(Watchdog,
na.rm = TRUE)) %>% knitr::kable()
# Frequency of temperament occurrence (using Indicator)
clean_joined_data %>% group_by(Temperament) %>% summarize(Frequency = mean(Indicator)) %>%
arrange(desc(Frequency)) %>% knitr::kable()
| Loyal |
0.3333333 |
| Affectionate |
0.3238095 |
| Alert |
0.3047619 |
| Friendly |
0.3047619 |
| Playful |
0.2571429 |
| Gentle |
0.2380952 |
| Independent |
0.1809524 |
| Protective |
0.1428571 |
# NA count across numerical and categorical variables
# (excluding AvgHeightInches and AvgWeightLbs)
joined_data %>% pivot_longer(11:18, names_to = "Temperament",
values_to = "Indicator") %>% summarize_all(function(x) sum(is.na(x)))
## # A tibble: 1 x 12
## Breed LowHeightInches HighHeightInches LowWeightLbs HighWeightLbs AvgPupPrice
## <int> <int> <int> <int> <int> <int>
## 1 0 16 16 16 16 0
## # … with 6 more variables: IntelligenceRating <int>, Watchdog <int>,
## # FirstClassification <int>, SecondClassification <int>, Temperament <int>,
## # Indicator <int>
## NOTE: AvgHeightInches and AvgWeightLbs were calculated and
## added to the joined dataset after removing NAs from the
## height/weight columns.
# NA count for all variables after cleaning some NA values.
clean_joined_data %>% ungroup() %>% summarize_all(function(x) sum(is.na(x)))
## # A tibble: 1 x 15
## Breed AvgHeightInches LowHeightInches HighHeightInches AvgWeightLbs
## <int> <int> <int> <int> <int>
## 1 0 0 0 0 0
## # … with 10 more variables: LowWeightLbs <int>, HighWeightLbs <int>,
## # AvgPupPrice <int>, IntelligenceRating <int>, Watchdog <int>,
## # FirstClassification <int>, SecondClassification <int>, Temperament <int>,
## # Indicator <int>, MeanHeight <int>
Visualizing
library(ggplot2)
# Plot 1 Height vs. Weight
clean_joined_data %>% ggplot(aes(x = AvgHeightInches, y = AvgWeightLbs)) +
geom_point(aes(color = FirstClassification)) + geom_smooth(method = "lm") +
scale_x_continuous(breaks = seq(0, 60, 5)) + scale_y_continuous(breaks = seq(0,
200, 25)) + theme(axis.text.x = element_text(hjust = 0.5,
vjust = -0.25)) + ggtitle("Average Height vs. Average Weight across Breeds and 1st Classification") +
xlab("Average Height (inches)") + ylab("Average Weight (lbs)")

clean_joined_data %>% ungroup() %>% summarize(cor(AvgHeightInches,
AvgWeightLbs, use = "pair"))
## # A tibble: 1 x 1
## `cor(AvgHeightInches, AvgWeightLbs, use = "pair")`
## <dbl>
## 1 0.855
This plot shows the regression between the average height vs. the average weight of each dog breed, colored based on their first classification. As seen, there is an positive linear trend between height and weight. As the dog breeds increase in height, their weight also increases, with some outliers being more heavy in proportion to their height. This plot highlights that most general working dogs are larger/heavier than the other categories of dogs, with the exception that there is some overlap with Hounds, Herding, and Sporting dogs at shorter heights. In contrast, non-sporting, terriers, and toy breeds are the smallest breeds of dogs. It is notable that sporting and herding dogs both have a medium sized body and overlapping height to weight proportions. This makes sense as both categories of dogs undergo high physical activity and are tall enough to complete their tasks (but not large enough to be more susceptible to health issues). In general, medium sized dogs typically have a healthier physique and are less prone to joint issues like hip dysplasia (which is prevalent throughout large breeds), and are thus more likely to be chosen for high physical demanding jobs. Overall, this plot illustrates the strong positive correlation between height and weight, while also highlighting the differences in physical differences between the different categories of dog breeds.
Your discussion of plot 1
# Plot 2 First classification vs. Pup Price
clean_joined_data %>% ggplot(aes(x = FirstClassification, y = AvgPupPrice)) +
geom_boxplot() + geom_jitter(alpha = 0.5, aes(color = SecondClassification)) +
scale_y_continuous(breaks = seq(0, 3000, 500)) + theme(axis.text.x = element_text(angle = 45,
hjust = 1)) + ggtitle("Average Puppy Price based on Classification of Breed") +
xlab("First Classification") + ylab("Average Puppy Price ($)")
This plot is a comparison of average puppy price across the first classification of dog breeds. Within each first grouping, there is an illustration of pricing across the second classification of the dog breeds. Overall this plot indicates that the three breeds that are typically less expensive are toy, non-sporting, and hound breeds, with non-sporting having a few expensive puppy prices as outliers. Based on median price, herding and working dogs are the most expensive dogs. Overall, the puppy prices for herding, sporting, terrier, and working breeds are similar when focusing on values above the 3rd quartile range. In terms of variance, herding and toy breeds have the least amount of spread while sporting, working, and non-sporting have the most spread and/or outliers.
# Plot 3 Temperament Frequency vs. First Classification
clean_joined_data %>% ggplot(aes(x = Temperament, fill = Temperament)) +
geom_bar(aes(y = Indicator), stat = "summary", fun = mean) +
geom_point(aes(y = MeanHeight/100, size = MeanHeight)) +
facet_grid(~FirstClassification) + theme(axis.text.x = element_text(angle = 45,
hjust = 1)) + ggtitle("Temperament Frquency across Classification of Dog Breeds") +
xlab("Temperament Traits") + ylab("Frequency of Temperament") +
coord_flip()
This plot highlights the temperament frequency across first classification of dog breeds and illustrates the average height for each classification. By doing this, it is possible to see if height is a contributing factor to temperament, and thus in turn a contributor to overall classification. The most obvious trend based on height is among the toy breeds. In terms of temperament, they are observed to not be protective of their owners and are typically not independent. This most likely is due to their small size, which is significantly smaller than the other groups. While seemingly negative, these dogs were most likely bred for these traits as toy breeds are typically bred for looks and status appeal. These dogs are often carried around by their owners (i.e. on laps, in bags, or simply holding), thus it would be more beneficial for these dogs to be dependent on their owners rather than struggling for independence. For the other groups of dog breeds, excluding herding, these groups have a medium to tall average height and are more well-rounded in terms of temperament frequency. For the most part, these groups contain dogs that have some/all of these traits, except for protectiveness in hounds. These findings are significant especially in regards to working and sporting dogs. Selection of all of these positive traits when breeding gives rise to more balanced and trainable dogs, which is important for task completion and environmental stability (not reactive to distractions). Similarly, herding dogs are bred for a specific task and illustrates those corresponding traits on the graph. Along with a medium-large body size, herding dogs have been selected for protectiveness, loyalty, and alertness. Overall, this plot demonstrates the correspondence between temperament, size, and breed classification, giving insight on how dogs were originally bred for this intended role.